4 research outputs found
CoLLD: Contrastive Layer-to-layer Distillation for Compressing Multilingual Pre-trained Speech Encoders
Large-scale self-supervised pre-trained speech encoders outperform
conventional approaches in speech recognition and translation tasks. Due to the
high cost of developing these large models, building new encoders for new tasks
and deploying them to on-device applications are infeasible. Prior studies
propose model compression methods to address this issue, but those works focus
on smaller models and less realistic tasks. Thus, we propose Contrastive
Layer-to-layer Distillation (CoLLD), a novel knowledge distillation method to
compress pre-trained speech encoders by leveraging masked prediction and
contrastive learning to train student models to copy the behavior of a large
teacher model. CoLLD outperforms prior methods and closes the gap between small
and large models on multilingual speech-to-text translation and recognition
benchmarks.Comment: Submitted to ICASSP 202
Exploring Speech Enhancement for Low-resource Speech Synthesis
High-quality and intelligible speech is essential to text-to-speech (TTS)
model training, however, obtaining high-quality data for low-resource languages
is challenging and expensive. Applying speech enhancement on Automatic Speech
Recognition (ASR) corpus mitigates the issue by augmenting the training data,
while how the nonlinear speech distortion brought by speech enhancement models
affects TTS training still needs to be investigated. In this paper, we train a
TF-GridNet speech enhancement model and apply it to low-resource datasets that
were collected for the ASR task, then train a discrete unit based TTS model on
the enhanced speech. We use Arabic datasets as an example and show that the
proposed pipeline significantly improves the low-resource TTS system compared
with other baseline methods in terms of ASR WER metric. We also run empirical
analysis on the correlation between speech enhancement and TTS performances.Comment: Submitted to ICASSP 202
Textless Speech-to-Speech Translation on Real Data
We present a textless speech-to-speech translation (S2ST) system that can
translate speech from one language into another language and can be built
without the need of any text data. Different from existing work in the
literature, we tackle the challenge in modeling multi-speaker target speech and
train the systems with real-world S2ST data. The key to our approach is a
self-supervised unit-based speech normalization technique, which finetunes a
pre-trained speech encoder with paired audios from multiple speakers and a
single reference speaker to reduce the variations due to accents, while
preserving the lexical content. With only 10 minutes of paired data for speech
normalization, we obtain on average 3.2 BLEU gain when training the S2ST model
on the VoxPopuli S2ST dataset, compared to a baseline trained on un-normalized
speech target. We also incorporate automatically mined S2ST data and show an
additional 2.0 BLEU gain. To our knowledge, we are the first to establish a
textless S2ST technique that can be trained with real-world data and works for
multiple language pairs. Audio samples are available at
https://facebookresearch.github.io/speech_translation/textless_s2st_real_data/index.html .Comment: Accepted to NAACL 2022 (long paper
SeamlessM4T-Massively Multilingual & Multimodal Machine Translation
What does it take to create the Babel Fish, a tool that can help individuals
translate speech between any two languages? While recent breakthroughs in
text-based models have pushed machine translation coverage beyond 200
languages, unified speech-to-speech translation models have yet to achieve
similar strides. More specifically, conventional speech-to-speech translation
systems rely on cascaded systems that perform translation progressively,
putting high-performing unified systems out of reach. To address these gaps, we
introduce SeamlessM4T, a single model that supports speech-to-speech
translation, speech-to-text translation, text-to-speech translation,
text-to-text translation, and automatic speech recognition for up to 100
languages. To build this, we used 1 million hours of open speech audio data to
learn self-supervised speech representations with w2v-BERT 2.0. Subsequently,
we created a multimodal corpus of automatically aligned speech translations.
Filtered and combined with human-labeled and pseudo-labeled data, we developed
the first multilingual system capable of translating from and into English for
both speech and text. On FLEURS, SeamlessM4T sets a new standard for
translations into multiple target languages, achieving an improvement of 20%
BLEU over the previous SOTA in direct speech-to-text translation. Compared to
strong cascaded models, SeamlessM4T improves the quality of into-English
translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in
speech-to-speech. Tested for robustness, our system performs better against
background noises and speaker variations in speech-to-text tasks compared to
the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and
added toxicity to assess translation safety. Finally, all contributions in this
work are open-sourced and accessible at
https://github.com/facebookresearch/seamless_communicatio